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A  variety  of  coordination  primitives  have  been  proposed  for  memory- 
sharing  MIMD  parallel  processors.  The  most  familiar  of  these  primitives  are 
probably  test-and-set  and  swap.  In  this  report  we  show  how  a  simple 
modification  of  the  fetch-and-add  coordination  primitive  [1]  and  [3]  recently 
proposed  for  the  "NYU  Ultracomputer"  [2]  yields  a  hardware  implementation 
technique  Chat  may  also  be  used  for  test-and-set  and  swap.  This  technique 
enables  one  to  support  a  wide  range  of  operations  on  any  network-based 
multiprocessor. 

Since  in  a  parallel  processor  Che  relative  cost  of  serial  bottlenecks 
rises  linearly  with  Che  number  of  processoring  elemenCs  (PEs),  users  of  fuCure 
ulcra-large  scale  parallel  machines  will  be  anxious  Co  avoid  Che  use  of 
critical  (and  hence  necessarily  serial)  code  sections,  even  if  these  sections 
are  short  enough  to  be  entirely  acceptable  in  current  practice. 
Bottleneck-free  fetch-and-add-based  implementations  of  several  basic  operating 
system  primitives  are  presented  in  [1]  and  [3].  These  algorithms  are  useful 
because  the  fetch-and-add  hardware  design  described  below  has  the  crucial 
property  Chat  multiple  concurrent  operations  directed  at  the  same  shared 
memory  cell  require  essentially  the  same  time  as  just  one  such  operation. 
(This   property  was  also   satisfied  by   previous   implementations.) 

We  first  review  the  definitions  of  Che  primicives  discussed  above  and 
describe  Che  basic  (network-based)  design  of  the  NYU  Ultracomputer.  We  then 
present  the  new  fetch-and-add  primitive  and  its  implementation.  Finally,  we 
show  how  other   primitives   may   be   implemented   in  the   same  way. 
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In  the  definitions  that  follow  we  use  the  brackets  {  and  }  to  group 
statements  that  must  be  executed  indivisibly.  We  define  test-and-set  to  be  a 
value-returning  procedure  operating  on  a   global  Boolean  variable. 

TestAndSet(G) 
{  Temp  •<-  G 

G  -t-  TRUE  } 
RETURN  Temp 

The  swap  operation  exchanges  the  values  of  a  local  variable  L  (which  specifies 
a  processor  register  or  stack  location)  and  a  global  variable  G. 

Swap(L,G) 

{  Temp  f-  L 
L  ■*-  G 
G  ^   Temp  } 

Our  variant  of  the  f etch-and-add  operation  is  a  value-returning  procedure 
operating  on  a   global   integer  variable  G   and  a   local   integer  variable  L. 

FetchAndAdd(G,L) 
{   Temp   f-  G 

G  ■<-   G  +  L    } 
RETURN  Temp 

We  note  that  it  is  perhaps  more  natural  to  have  f etch-and-add  return  the 
"new"   value  of  G,    i.e. 

AltFetchAndAdd(G,L) 
{   G  t-   G  +  L 
RETURN  G    } 

In      fact      this      is      the     definition   given   in    [1],    [2],    and    [3],    and  until  very 

recently  was   the   version   of   fetch-and-add  we  planned   to      implement  [2],        (In 

these  earlier  reports  this  operation  was  called  replace-add  but  we  believe 
that   the   present   terminology  is   more   descriptive.)   Of   course 

AltFetchAndAdd(G,L)      is   equivalent   to     FetchAndAdd(G,L)   +  L 
and  FetchAndAdd(G,L)      is   equivalent   to     AltFetchAndAdd(G,L)   -  L 

so  there  is  little  reason  to  prefer  one  definition  over  the  other.  However, 
if  for  the  addition  operation  we  substitute  a  noninvertible  operation,  say 
maximum,    then  we   see   that 

AltFetchAndMax(G,L)      is  equivalent   to     Max(FetchAndMax(G,L) ,L) 

but  we   cannot   obtain  FetchAndMax  from  AltFetchAndMax. 
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More  generally  for  any  associative  binary  operation  (J)  defined  on  a  set  S, 
we  define  the  value-returning  procedure  f etch-and-(J)  operating  on  a  global 
S-ranging  variable  G  and  a   local  S-ranging  variable  L. 

FetchAnd<ti(G,L) 
{  Temp   *  G 

G  *   ())(G,L)    } 
RETURN  Temp 

We  now  review  the  basic  design  of  the  NYU  Ultracomputer.  As  illustrated 
in  figures  1  and  2,  this  machine  consists  of  autonomous  processing  elements 
(PEs)  that  access  a  like  number  of  memory  modules  (MMs )  via  an  fi 
interconnection        network.  The     processor     network      interfaces      (PNIs)      are 

essentially  caches  and  not  germane  to  the  present  discussion.  The  memory 
network  interfaces  (MNIs)  contain  the  adders  used  to  update  the  shared 
variables   as    required   in  the    fetch-and-add   operation. 

Before  describing  the  switch  architecture  permitting  rapid  execution  of 
concurrent  fetch-and-adds  referencing  a  common  global  variable,  we  illustrate 
the  required  behavior.  Suppose  that  two  PEs  concurrently  execute 
FetchAndAdd(G, 1) ,  that  G  is  initially  0,  and  that  no  other  concurrent 
fetch-and-adds  on  G  occur.  Then,  when  the  two  concurrent  fetch-and-adds 
terminate,  G  contains  the  value  2,  one  PE  receives  the  value  0  (the  PE 
"chosen"  to  be  indivisibly  executed  first),  and  the  other  PE  receives  the 
value   1 . 

When  concurrent  fetch-and-adds  referencing  a  common  global  variable  meet 
at  a  switch,  they  may  be  combined  as  illustrated  below.  When  this  combined 
fetch-and-add  returns  a  value  to  the  switch  from  memory  the  two  original 
operations  are  satisfied.  See  Figure  3  for  the  time  sequence  of  the  switch's 
operation,  which  is  summarized  in  Figure  4.  (Note  that  although  we  use  the 
term  "switch",  this  device  contains  adders  and  memory  and  is  thus  comparable 
in  complexity   to  a   simple   microprocessor.) 

If  a  combined  request  encounters  another  fetch-and-add  referencing  the 
same  variable,  the  combined  requests  can  themselves  be  combined  and  the 
associative  law  guarantees  that  correct  actions  are  performed.  Figure  5 
illustrates  this  for  four  PEs  executing  FetchAndAdd(G, 1) ,  FetchAndAdd(G, 10) , 
FetchAndAdd(G, 100),    and  FetchAndAdd(G, 1000) ,   with  G   initially   zero. 
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For  any  associative  operator  4),  it  is  straightforward  to  generalize  the 
fetch-and-add  switch  shown  in  Figure  4  to  a  "f etch-and-(J)  switch"  shown  in 
Figure  6  and  it  is  easy  to  see  that  the  associativity  of  <})  guarantees  that 
combined  f etch-and-(j)  operations  can  themselves  be  combined  as  illustrated  in 
Figure  5  for  fetch-and-add.  We  also  note  that  analogous  switches  can  be  used 
with  other  networks  (Banyan,  etc.)  providing  that  the  reply  from  an  MM  to  a  PE 
traverses   the  same  path  as   the   original  request   (but  in  reverse  order). 

We   now  show  how  each  of   the   coordination   primitives   discussed     above  can 

be      obtained      by     substituting     a      specific     associative      operation   for  ()) .  Of 

course,  letting  (J)(X,Y)  =  X+Y  gives  fetch-and-add;  moreover  test-and-set  is 
just   a   fetch-and-OR  with  TRUE,    i.e. 

TestAndSet(G)      is   equivalent   to     FetchAndOR(G,TRUE) 

Finally,  a  swap  operation  can  be  effected  by  using  the  projection  operator 
Il2(X,Y)    =  Y,    i.e. 

Swap(L,G)      is   equivalent   to     L  ■«-   FetchAndn2(G,L)    . 

We  conclude  this  report  by  showing  that  fetch-and-!j)  may  be  used  as  the 
sole  primitive  for  accessing  global  memory.  Specifically,  we  show  how  to 
obtain  the  familiar  load  and  store  operations  as  degenerate  examples  of 
fetch-and-^.  To  load  the  local  variable  L  from  the  global  variable  G  one 
simply   executes 

L  -*-   FetchAndnj^(G,*) 

where  IIj^(X,Y)  =  X  and  the  value  of  *  is  immaterial  (and  thus  need  not  be 
transmitted).  Similarly,  to  store  the  value  of  the  local  variable  L  into  the 
global  variable  G   one  executes 

*  -^   FetchAndn2(G,L) 

where  n2(X,Y)  =  Y  and  the  *  indicates  that  the  value  returned  is  not  used  and 
thus   again  need  not   be   transmitted. 
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Figure  1.   Block  Diagram. 
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Figure  3. 
(FAA  abbreviates  FetchAndAdd) 
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